-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial implementation to add external health monitor controller and agent #3
Initial implementation to add external health monitor controller and agent #3
Conversation
Ignore this PR first please. It is far away from completion Just a WIP PR to show progress. |
86e741f
to
4bcef6a
Compare
e92257d
to
e20cae9
Compare
af812e9
to
e353fb1
Compare
/test continuous-integration/travis-ci/pr |
e353fb1
to
e437396
Compare
os.Exit(1) | ||
} | ||
|
||
supportGetVolume, err := supportGetVolume(ctx, csiConn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check LIST_VOLUMES support first. If both LIST_VOLUMES and VOLUME_CONDITION are supported, we should use that to retrieve volume health. This will have much better performance than making individual calls to GetVolume.
If LIST_VOLUMES is not support, then we check the combination of GET_VOLUME and VOLUME_CONDITION.
} | ||
|
||
// TODO: move this to csi-lib-utils | ||
// TODO: do we need to consider ControllerServiceCapability_RPC_LIST_VOLUMES_VOLUME_HEALTH ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I added a comment above that LIST_VOLUMES should be checked and it should be the preferred way to retrieve information in batch.
Also you may be using an earlier version of the spec? LIST_VOLUMES_VOLUME_HEALTH is removed. Now we just have one VOLUME_CONDITION which is used for both LIST_VOLUMES and GET_VOLUME.
showVersion = flag.Bool("version", false, "Show version.") | ||
timeout = flag.Duration("timeout", 15*time.Second, "Timeout for waiting for attaching or detaching the volume.") | ||
workerThreads = flag.Uint("worker-threads", 10, "Number of pv monitor worker threads") | ||
kubeletRootPath = flag.String("kubelet-root-path", "/var/lib/kubelet", "The root path pf kubelet.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/pf/of
monitorInterval = flag.Duration("monitor-interval", 1*time.Minute, "Interval for controller to check volumes health condition.") | ||
|
||
kubeconfig = flag.String("kubeconfig", "", "Absolute path to the kubeconfig file. Required only when running out of cluster.") | ||
resync = flag.Duration("resync", 10*time.Minute, "Resync interval of the controller.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add retryIntervalStart and retryIntervalMax?
metricsManager.SetDriverName(storageDriver) | ||
metricsManager.StartMetricsEndpoint(*metricsAddress, *metricsPath) | ||
|
||
supportGetNodeVolumeHealth, err := supportGetNodeVolumeHealth(ctx, csiConn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's renamed to VOLUME_CONDITION in CSI spec, so change to supportNodeGetVolumeCondition
os.Exit(1) | ||
} | ||
if !supportGetNodeVolumeHealth { | ||
klog.V(2).Infof("CSI driver does not support VolumeHealth Node Capability, exiting") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VolumeCondition
continue | ||
} | ||
t := rpc.GetType() | ||
if t == csi.NodeServiceCapability_RPC_GET_VOLUME_STATS { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodeServiceCapability_RPC_VOLUME_CONDITION
|
||
metricsAddress = flag.String("metrics-address", "", "The TCP network address where the prometheus metrics endpoint will listen (example: `:8080`). The default is empty string, which means metrics endpoint is disabled.") | ||
metricsPath = flag.String("metrics-path", "/metrics", "The HTTP path where prometheus metrics will be exposed. Default is `/metrics`.") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add retryIntervalStart and retryIntervalMax?
} | ||
|
||
// TODO: move this to csi-lib-utils | ||
func supportGetVolumeCondition(ctx context.Context, csiConn *grpc.ClientConn) (supportGetVolumeHealth bool, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supportControllerGetVolumeCondition
Add one for ListVolumes
os.Exit(1) | ||
} | ||
|
||
supportGetVolume, err := supportGetVolume(ctx, csiConn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supportControllerGetVolume
} | ||
|
||
// TODO: move this to csi-lib-utils | ||
func supportGetVolume(ctx context.Context, csiConn *grpc.ClientConn) (supportGetVolume bool, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supportControllerGetVolume
- apiGroups: [""] | ||
resources: ["pods"] | ||
verbs: ["get", "list", "watch"] | ||
#Secret permission is optional. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need secrets yet for get and list.
mountPath: /var/lib/csi/sockets/pluginproxy/ | ||
|
||
- name: mock-driver | ||
image: quay.io/k8scsi/mock-driver:canary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you going to submit a PR in csi-test to implement volume health in the mock-driver? Otherwise we can't test this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will test this on my local env first, and can send a PR in csi-test if needed
- apiGroups: [""] | ||
resources: ["pods"] | ||
verbs: ["get", "list", "watch"] | ||
#Secret permission is optional. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need Secrets for this.
go.mod
Outdated
go 1.12 | ||
|
||
require ( | ||
github.com/container-storage-interface/spec v1.2.0-rc1.0.20200415022618-e129a75169c1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have the latest changes we need for VolumeCondition? Should we use master for now?
go.mod
Outdated
golang.org/x/tools v0.0.0-20190524140312-2c0ae7006135 // indirect | ||
google.golang.org/grpc v1.28.0 | ||
honnef.co/go/tools v0.0.0-20190523083050-ea95bdfd59fc // indirect | ||
k8s.io/api v0.17.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update everything to v0.18.0
// TODO: support inline csi volumes | ||
for _, volume := range pod.Spec.Volumes { | ||
if volume.PersistentVolumeClaim != nil { | ||
pvc, err := agent.pvcLister.PersistentVolumeClaims(pod.Namespace).Get(volume.PersistentVolumeClaim.ClaimName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with v0.18.0, you'll need to have a context and options in all these API calls as well.
@@ -0,0 +1,65 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a sub-folder "pv-monitor-controller" under "controller"? Should the files just be directly under "controller"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just in case we want to add new controllers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll change the directory structure when there's a need for new controllers.
For now, please remove extra layer of directory,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done
@@ -0,0 +1,74 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a sub-folder "pv-monitor-agent" under "agent"? Should the files just be directly under "agent"?
volumeListAndAddInterval = flag.Duration("volume-list-add-interval", 5*time.Minute, "Time interval for listing volumes and add them to queue") | ||
nodeListAndAddInterval = flag.Duration("node-list-add-interval", 5*time.Minute, "Time interval for listing nodess and add them to queue") | ||
workerThreads = flag.Uint("worker-threads", 10, "Number of pv monitor worker threads") | ||
enableNodeWatcher = flag.Bool("enable-node-watcher", false, "whether we want to enable node watcher, node watcher will only have effects on local PVs now, it may be useful for block storages too, will take this into account later.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it may be useful for block storages too, will take this into account later
"block storage"? Do you mean "remote storage"?
If this only works for local pv now, how do you verify and stop user from enabling this feature if it is not local pv?
showVersion = flag.Bool("version", false, "Show version.") | ||
timeout = flag.Duration("timeout", 15*time.Second, "Timeout for waiting for attaching or detaching the volume.") | ||
listVolumesInterval = flag.Duration("list-volumes-interval", 5*time.Minute, "Time interval for calling ListVolumes RPC to check volumes' health condition") | ||
volumeListAndAddInterval = flag.Duration("volume-list-add-interval", 5*time.Minute, "Time interval for listing volumes and add them to queue") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option is used when ListVolumes is not supported? Should it be called GetVolumeAndAddInterval instead?
<-stopCh | ||
} | ||
|
||
func (agent *PVMonitorAgent) atLeastOneFieldIsEmpty(podWithPVItem *PodWithPVItem) bool { | ||
func (agent *PVMonitorAgent) isPodWithPVItemInValid(podWithPVItem *PodWithPVItem) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/InValid/Invalid
} | ||
|
||
klog.V(4).Infof("Started PV processing %q", podWithPV.pvName) | ||
klog.V(6).Infof("Started PV processing %q", podWithPV.pvName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually use V(5) for debugging. Can you change V(6) to V(5)?
After we merged the mock-driver PR, I try to deploy external-health-monitor in my personal cluster(v1.15.3). I found it cannot work normally. I just found two bugs because the controller and agent always were panic after they started for a while
I will try to modify it in my own experimental env first for testing. |
After I changed, I think agent and controller can work fine
[root@experiment-lab001-xuran:~/workspace/csi-test/external-health-monitor]# kubectl get pods
NAME READY STATUS RESTARTS AGE
csi-external-health-monitor-agent-mcxqn 2/2 Running 0 4m8s
csi-external-health-monitor-agent-sjcgm 2/2 Running 0 3m58s
csi-external-health-monitor-controller-6bddc8c88c-6ngcq 2/2 Running 0 111s
csi-external-health-monitor-controller-6bddc8c88c-bvm89 2/2 Running 0 104s
csi-external-health-monitor-controller-6bddc8c88c-ls26g 2/2 Running 0 2m1s |
Thanks @fengzixu for testing and fixing the bugs! Hi @NickrenREN, do you want to fix these bugs on your PR or do you want @fengzixu to submit a PR on top of yours? |
In addition to bugs, I mentioned above, I finished the basic test of external-health-monitor using the csi-test. So, let me summarize the testing report as below. Test Cases
Cluster Info
What I did
ResultCase1I run // List
gRPCCall: {"Method":"/csi.v1.Controller/ListVolumes","Request":{},"Response":{"entries":[{"volume":{"capacity_bytes":107374182400,"volume_id":"1","volume_context":{"name":"Mock Volume 1"}},"status":{"published_node_ids":["io.kubernetes.storage.mock"],"volume_condition":{}}},{"volume":{"capacity_bytes":107374182400,"volume_id":"2","volume_context":{"name":"Mock Volume 2"}},"status":{"published_node_ids":["io.kubernetes.storage.mock"],"volume_condition":{}}},{"volume":{"capacity_bytes":107374182400,"volume_id":"3","volume_context":{"name":"Mock Volume 3"}},"status":{"published_node_ids":["io.kubernetes.storage.mock"],"volume_condition":{}}}]},"Error":"","FullError":null}
// Get
gRPCCall: {"Method":"/csi.v1.Controller/ControllerGetVolume","Request":{"volume_id":"1"},"Response":{"volume":{"capacity_bytes":107374182400,"volume_id":"1","volume_context":{"name":"Mock Volume 1"}},"status":{"published_node_ids":["io.kubernetes.storage.mock"],"volume_condition":{}}},"Error":"","FullError":null} Case2Add some temporary code to test a simple negative case // CheckControllerListVolumeStatuses checks volumes health condition by ListVolumes
func (checker *PVHealthConditionChecker) CheckControllerListVolumeStatuses() error {
ctx, cancel := context.WithTimeout(context.Background(), checker.timeout)
defer cancel()
result, err := checker.csiPVHandler.ControllerListVolumeConditions(ctx)
if err != nil {
return err
}
// Positive Case
_, err = checker.csiPVHandler.ControllerGetVolumeCondition(ctx, "1")
if err != nil {
return err
}
// Negative Case
_, err = checker.csiPVHandler.ControllerGetVolumeCondition(ctx, "Mock Volume 10")
if err != nil {
return err
} Result // Positive
gRPCCall: {"Method":"/csi.v1.Controller/ControllerGetVolume","Request":{"volume_id":"1"},"Response":{"volume":{"capacity_bytes":107374182400,"volume_id":"1","volume_context":{"name":"Mock Volume 1"}},"status":{"published_node_ids":["io.kubernetes.storage.mock"],"volume_condition":{}}},"Error":"","FullError":null}
// Negative
gRPCCall: {"Method":"/csi.v1.Controller/ControllerGetVolume","Request":{"volume_id":"Mock Volume 10"},"Response":{"status":{"volume_condition":{"abnormal":true,"message":"volume not found"}}},"Error":"rpc error: code = NotFound desc = Mock Volume 10","FullError":{"code":5,"message":"Mock Volume 10"}} Conclustion
As I said before, this PR of csi-test kubernetes-csi/csi-test#268 only for testing the basic function of external-health-monitor. Just checking if it can invoke the What we will doAfter we merged this PR, I think we can do some tasks asap
|
@NickrenREN @xing-yang If you are available, please check my comment |
I fixed bugs that cause panic though submit the PR to @NickrenREN 's repository Maybe @NickrenREN can merge it and update this PR |
@fengzixu thanks a lot for doing this. |
serviceAccount: csi-external-health-monitor-controller | ||
containers: | ||
- name: csi-external-health-monitor-controller | ||
image: quay.io/k8scsi/csi-external-health-monitor-controller:canary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are going to move the image repo to a different place so images for this repo won't be here any more.
To avoid confusion, please change this to:
image:
We can update this after this PR is merged and all necessary release-tools changes are merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge this PR now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still reviewing this.
I forget one patch about the clusterrole of Event resource. I think controller needs it. So I fixed it. |
@NickrenREN Can you squash the commits? Looks good to me other than that. I'll merge after you've squashed them. |
@fengzixu, If Nick merges your PR into his branch, it will be merged together with this PR. Otherwise you can submit a separate PR so it can be merged after this PR is merged. |
1. remove the leader-election from argument list of agent 2. assign correct value to pvcListerSynced
2961012
to
76fc3fd
Compare
@xing-yang done |
Sure |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: NickrenREN, xing-yang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR adds an external health monitor controller and an external health monitor agent.
Which issue(s) this PR fixes:
Fixes #1
Fixes #2
Does this PR introduce a user-facing change?: